Two-sample survival tests based on control arm summary statistics

The one-sample log-rank test is the preferred method for analysing the outcome of single-arm survival trials. It compares the survival distribution of patients with a prefixed reference survival curve that usually represents the expected outcome under standard of care. However, classical one-sample log-rank tests assume that the reference curve is known, ignoring that it is frequently estimated from historical data and therefore susceptible to sampling error. Neglecting the variability of the reference curve can lead to an inflated type I error rate, as shown in a previous paper. Here, we propose a new survival test that allows to account for the sampling error of the reference curve without knowledge of the full underlying historical survival time data. Our new test allows to perform a valid historical comparison of patient survival times when only a historical survival curve rather than the full historic data is available. It thus applies in settings where the two-sample log-rank test is not applicable as method of choice due to non-availability of historic individual patient survival time data. We develop sample size calculation formulas, give an example application and study the performance of the new test in a simulation study.


Introduction
The one-sample log-rank test [1] is the method of choice for single-arm Phase II trials with time-to-event endpoint.It allows to compare the survival of the patients to a prefixed reference survival curve that typically represents the expected survival under standard of care.
A central point of criticism of the one-sample log-rank test relates to the process of selecting the reference survival curve.It is common practice to choose the reference survival curve in the light of historic data on standard treatment.This implies that choice of the reference survival curve itself is prone to statistical error, which however, is ignored in the classical onesample log-rank statistic.In classical log-rank tests the prefixed reference curve is rather treated as deterministic.Doing so results in an inflation of the type I error rate, when in fact the prefixed reference curve resulted from an estimation process [2].
For this reason, it is standard practice to use the two-sample log-rank test to compare patient survival with a historical control group whenever historical survival data are available.However, situations may arise in which the full historical survival data on individual patient level is not (or no longer) available to the researcher.This may occur for a variety of reasons, ranging from copyright to privacy reasons, such as when data sharing is not covered by informed consent or individual patients withdrew their consent.In such a situation, generally only summary statistics such as a survival distribution estimator based on historical survival times together with basic data on sample size, event numbers, and length of follow-up will be available in peer-reviewed scientific publications.Then, application of the two-sample logrank test is not possible because the required complete historical survival data on individual patient level are not available.Conversely, it is also not permissible to use the published survival estimator as deterministic reference curve in a classical one-sample log-rank test, since this ignores the sampling error of the reference curve, leading to inflation of the type I error rate [2].
In this paper, we present a solution to this problem.We propose a new survival test that, under mild assumptions, allows the sampling error of the reference curve to be taken into account without knowledge of the full historical survival time data.This allows to perform a valid historical comparison of patient survival times when only an estimator of the historical survival curve is available instead of the full historical survival data.Our proposed test can formally also be interpreted as a two-sample test for survival times, and can thus be compared with the two-sample log-rank test in terms of test performance.However, the intended application focus of our new test is comparison of survival against a historical reference in the absence of the full historical survival time data.
The paper is organized as follows.After settling notation and the testing problem, we describe the test statistic and its distributional properties.We then will provide an example application on real world data.Additionally, we provide sample size calculation methodology.Calculation of rejection regions and sample size are based on the approximate distribution of the new test statistic in the large sample limit.Therefore small sample properties of the new test regarding type I and type II error rate control are studied by simulation, and compared to the two-sample log-rank test.We conclude with a discussion of future research.

Notation
We consider a single-arm clinical trial with survival data from a treatment group B, say (experimental intervention, prospectively collected data).This survival data shall be tested against the survival under standard of care treatment A, say represented by the reference survival curve S T A .However, in practice the true survival curve under standard of care S T A is unknown and thus researchers may rely on an estimator of the reference survival curve b S T A taken e.g. from a publication in a peer reviewed journal.We denote with b L A a corresponding estimator of the cumulative hazard either taken from the literature or obtained through the equation b L A ¼ À logð b S T A Þ.These estimators stem from a group of historical patients who were treated under standard of care.However, the individual data on patient level are usually not available for the public due to data protection and privacy reasons.
Let N B denote the set of patients from group B, n B jN B ≔ j the number of such patients.We denote by T B,i and C B,i the time from entry to event or censoring for patient i 2 N B respectively.Let X B,i ≔ T B,i ^CB,i denote the minimum of both.As usual, we assume that the T B,i and C B,i are mutually independent (non-informative censoring).
('Bundesdatenschutzgesetz', BDSG).Therefore, they are stored on a secure drive in the WIdO, to facilitate replication of the results.Generally, access to data of statutory health insurance funds for research purposes is possible only under the conditions defined in German Social Law (SGB V § 287).Requests for data access can be sent as a formal proposal specifying the recipient and purpose of the data transfer to the appropriate data protection agency.Access to the data used in this study can only be provided to external parties under the conditions of the cooperation contract of this research project and after written approval by the sickness fund.For assistance in obtaining access to the data, please contact wido@wido.bv.aok.de.

Funding:
The work of MFD was funded by the German Science Foundation (Deutsche Forschungsgemeinschaft, DFG, https://www.dfg.de, grant number 413730122).The study was conducted within the framework of the GenderVasc project (Gender-specific real care situation of patients with arteriosclerotic cardiovascular diseases) funded by The Federal Joint Committee, Innovation Committee (G-BA, Innovationsfond, number 01VSF18051).GenderVasc is a cooperation project with the AOK Research Institute (WIdO).The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests:
The authors have declared that no competing interests exist.
Based on the observed data, we calculate the number of events from treatment group B up to study time s � 0 as As usual, we let λ B (s) ≔ lim Δ!0 P(s � T B,i < s + Δ|T B,i � s)/Δ denote the true hazard of a patient i 2 N B and denote by L B ðsÞ ≔ R s 0 l B ðuÞ du the corresponding cumulative hazard function for s � 0, respectively.
Finally, we denote by F T B and S T B the distribution function and survival function of the time to event T B,i and denote with f C B the density function of the censoring time C B,i for i 2 N B .We assume the tuples (C B,i , T B,i ) i to be independent and identically distributed and assume C B,i and T B,i to be independent for each i 2 N B .We denote by a B the recruitment period length, with f B the follow up period length resulting in a trial period length of a B + f B for the new intervention trial.In particular, we assume that patients enter the trial between calendar time 0 and a B , and are then further followed-up until calendar time a B + f B , unless they are not censored or experience an event.Therefore the intended follow up of the first patient is a B + f B and of the last patient is f B respectively.Furthermore we denote with r B ≔ n B /a B the recruitment rate of the new trial.
The above definitions shall be applied analogously to group A. In particular, we denote by π = n B /n A the ratio of the group sizes.However, notice again that usually the individual data X A,i , T A,i and C A,i for i 2 N A are unknown to the researcher and only the survival curve estimator b S T A with summary statistics about number at risk and sample size are available.

The testing procedure
We are interested in testing H 0 : L B ðsÞ ¼ L A ðsÞ for all s 2 ½0; s max �; for some prefixed time-horizon s max > 0, i.e. that the survival of the patients under the new experimental treatment coincides with survival under standard of care.The classical one-sample log-rank test could be used to test H 0 if the reference cumulative hazard function Λ A was known.However, the practical user faces the problem that the true reference curve Λ A representing survival under standard of care is unknown.For this reason Λ A is often replaced within all test statistics by its Nelson-Aalen estimate b L A derived from historic data.This means that the classical one-sample log-rank test of H 0 is based on the statistic , where F −1 is the standard normal quantile function.This proceeding would only be admissible if the variance of b L A was negligible.In practice, however, Varð b L A Þ is typically not negligible [2].Then, using the process b M 0 with the classical standardisation b S 2 OSLR leads to an under-estimation of variance and thus inflation of the type I error rate.The factor of under-estimation of variance may be approximated by Then, an approximate level α test of H 0 is defined by rejecting H 0 whenever where α is the desired two-sided significance level.Notice again that we used the assumption of equal recruitment and censoring mechanisms in the new experimental and historic control cohort.We will keep this assumption for the rest of the manuscript and will investigate how deviation from this assumption affects the type I error rate control within simulation studies.Furthermore, notice that procedure ( 5) is equivalent to adjusting the nominal α-level of the classical one-sample log-rank test methodology to or analogously using the critical boundary Notice that in theory one could use a corrected variance estimator b S 2 for Varð b M 0 ðsÞÞ (see Eq. ( 5) of [2]) instead of our approximation b S 2 OSLR ðs max Þ � ð1 þ pÞ.However, using the statistic Z π yields some advantages.Firstly, calculation of Z π can be executed with existing standard software whereas b S 2 needs additional implementation effort to be calculated.Secondly, for computation of b S 2 the full dataset of the historical patients is needed, whereas Z π is based only on the sample size n A and the estimated reference curve b L A [3] (or equivalently the Kaplan Meier estimator b S A [4]).This enables the usage of the Z π -test (5) even in absence of full historical survival time data.

A real world example
Our hypothetical example stems from clinical cardiology and originates from the context of the 'obesity paradox'.This paradox refers to the observation that in certain cardiovascular diseases, overweight patients may exhibit a more favourable prognosis compared to non-obese patients.This phenomenon has been observed in several cardiovascular diseases, including heart failure, coronary heart disease and acute myocardial infarction.Despite the well-established link between obesity and an increased risk of cardiovascular disease, studies have found that obese individuals particularly those with pre-existing cardiovascular disease, may have a lower mortality risk compared to non-obese patients.This observation has generated considerable interest and debate in the medical community.In this context it might be interesting to test the null hypothesis H 0 that the survival of obese patients suffering from acute myocardial infarction coincides with the survival of non-obese patients suffering acute myocardial infarction against a significance level of α = 5%.
For researchers it could be beneficial to use reference survival curves published by large registries or obtained from health insurance data, as these could serve as a suitable option for representing the general population.
However, in our experience, when health insurance companies grant researchers access to their insurance data they strictly prohibit to extract individual patient level data.Thus our method to make a comparison based upon summary statistics is needed.The following survival curve in It includes all patients hospitalized with a myocardial infarction in 2014 aged 30-60, without obesity and without a prior indication of chronic heart failure.For our hypothetical example, this survival curve should take on the role of the historically estimated reference curve b S T A on which the access to the underlying patient level data X A,i , I(T A,i � C A,i ) for i 2 N A may be restricted to the researchers.
We used the WebPlotDigitizer website (https://apps.automeris.io/wpd/) to extract the x-y data at 481 points from the obtained image, which we interpolated to reconstruct the reference survival function b S T A to finally calculate the reference cumulative hazard function b If the classical one-sample log-rank test ( 2) is used without considering the sampling variability of the reference curve, the test would yield a p-value of p OSLR = 0.0320, yielding a significant difference between the two groups.However using our new technique (4) with π = 2632/ 10061 we obtain a p-value of p π = 0.0562 being non significant.
Another approach to test the two groups would be to use a reconstruction algorithm [5] to obtain reconstructed censored survival times XA;i for i 2 N A on individual patient level of the reference cohort to plug into the classical two-sample log-rank test.This approach would lead to a p-value of p TSR = 0.0449, yielding statistical significance once again.However, this procedure can only be used with great caution.On one hand, reconstruction algorithms deliver solid results when used to reproduce Kaplan Meier point estimators since they are designed to reproduce these as accurately as possible.However, using the reconstructed data for estimating hazard ratios or plugging them into a two-sample log-rank test may lead to poor results, as small, but systemic, deviations that are of little relevance for point estimators may add up over the course of the tested time frame [5].These deviations are not accounted for by the methodology of the two-sample log-rank test and may endanger the correct type I error rate control.This effect is especially prominent when the number of extracted x-y data points from the Kaplan Meier plot image is lower than the number of individual patients data points which need to be reconstructed.In this case the algorithm is under-saturated.Finally, if researchers had both original data sets and applied a two-sample log-rank test, the resulting p-value would be p TS = 0.0503 indicating no significance.Thus, both the classical one-sample log-rank test and the two-sample log-rank test using reconstructed data resulted in anti-conservative p-values.In summary our analysis agrees with the original two-sample log-rank test which does not confirm the obesity paradox for the cohort of patients aged 30-60 suffering from acute myocardial infarction.

General formula
Testing H 0 to the level α using Z π is equivalent to testing H 0 using the classical one-sample logrank test to the adjusted level α nominal .Thus the sample size formula for the classical one-sample log-rank test from [6] immediately yields a sample size formula for the new test based on Z π .In order to achieve a power of approximately 1 − β for an allocated significance level α and constant hazard ratio ω ≔ Λ B (s)/Λ A (s) for all s � 0, the new test procedure (5) has to be performed based on a total of events [6].Notice that above equation is a direct result of plugging our adjusted significance level α nominal into the sample size formula of the classical one-sample log-rank test (see Eq. ( 4) of [6]).However the existing methodology and software can not be used directly as the adjusted significance level α nominal itself depends on the sample size through π.This number of events d can be expected to be achieved if the sample size is chosen such that Through administrative censoring at calendar time a B + f B it is clear that f C B is only nonzero on the interval [0, a B + f B ] and thus the integral limits reduce accordingly.Through fixing two of the three parameters a B , f B and r B by clinical consideration and by using the identities n B = r B � a B and π = (r B � a B )/n A as well as recognizing the functional dependency of f C B on the parameters a B and f B , one can numerically solve the above equation for the last free parameter of a B , f B and r B .We provide R-Code in the supplemental material to perform this sample size calculation.

Closed formula for rare late events
The above methodology describes a general approach for sample size calculation.However, in the clinical setting, it may be advantageous for the statistician to have a closed formula at hand.For this reason we will develop a closed formula for a special case often met in practice.
We assume the following conditions: (i): No censoring except for administrative censoring at calendar time a B + f B (ii): A plateau of the survival function at the edge of the observation window [0, s max ] such that Assumption (ii) may be approximately met when the survival curve stagnates for a long period of time.An example might be given by the survival curve of patients after a complex surgery, which sharply declines shortly after the procedure but stagnates after a year.Also, in the context of cause specific deaths such a plateau may occur naturally.
We now evaluate the power formula Under the assumptions (i) and (ii), the probability of observing an event is equal to the probability of having an event in general, and thus the expectation on the left hand side is tantamount to n B � p B , with p B ≔ 1 À S T A ðs max Þ o .We can obtain the required sample size n B by solving the above equation for for z ≔ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi . Furthermore, using the identity π = n B /n A , solving the equation above is equivalent to calculating the roots of the quadratic polynomial Notice that since the intercept and linear coefficient are negative for 1 − β > 0.5, the polynomial has a positive root if and only if the quadratic coefficient is positive e.g.
This yields a lower bound for n A .When n A falls below this threshold it is not possible to get a significant result to a power of at least 50% when sampling variability is taken into account.Notice that the assumptions (i) and (ii) were introduced, to easily approximate the probability of observing an event P(T B < C B ).However, using the inequality o , we can also obtain a lower bound for the more general sample size calculation: Furthermore if inequality (7) holds, the distinct positive root of the polynomial can be calculated and thus the required sample size n B is given by

Design
We proposed a significance test for the null hypothesis H 0 based on the approximate large sample distribution of the test statistic Z π introduced before.Notice that only the sample size n A and the graph of the Nelson Aalen estimate b L A of group A is needed to compute the test statistic Z π .Knowledge of full survival data for group A is not required.However, in case of full knowledge of the data of group A, our test with consideration of reference curve variability, may also be interpreted as a two-sample survival test.This simulation, therefore aims to study performance of the new survival test for sample sizes of practical relevance, compared to the classical twosample log-rank test, which by design takes the variability of the historical data into account (see [7,8]).Asymptotically (i.e. for sufficiently large sample size) the classical two-sample log-rank test is known to be the optimal two-sample test under proportional hazards (PH) alternatives.It is thus of particular interest to compare performance of both tests under PH alternatives.
In clinical practice the sample size calculation methodology is most likely to be used with the recruitment rate r B and follow up duration f B fixed by clinical consideration and thus solving for the accrual period length a B , resulting in the sample size n B ≔ r B � a B .However, since our approximation is based on the assumption that the censoring and recruitment mechanisms are equal to the historic cohort, we use the sample size equation to solve for r B with predefined accrual period length a B and follow up duration f B (and thus fixing the administrative censoring and recruitment mechanism independently from the sample size), to have clear insight in the performance of our test if the assumption of equal censoring and recruitment mechanisms is violated.
In our simulations, patients were assumed to enter the trial uniformly between year 0 and year a A = a B = 3 and followed up for f B 2 {2, 3, 4} in the intervention cohort and f A = 3 years in the historic cohort.
Accordingly, the calendar times of entry were generated according to a uniform distribution We considered scenarios with no loss to follow-up setting and also considered scenarios with additional independent exponentially distributed censoring C * x;i � Expðl C x Þ, where overall censoring was set as In these Scenarios we set l C A ¼ 0:20 and varied l C B 2 f0:15; 0:20; 0:25g between scenarios to analyse the robustness of the test, when censoring mechanisms are not equal.Survival times T A,i in the control group A were generated according to a Weibull distribution i.e.Λ A (s) ≔ − log(s) � t κ with prefixed shape parameter κ 2 {0.1, 0.25, 0.5, 1.0, 1.5, 2.0, 5.0} and 1-year survival rate S 1 ≔ S T A ð1Þ ≔ 0:5.To implement the PH condition, survival times T B, i in the experimental intervention group B were generated according to a Weibull distribution i.e.Λ B (s)2/3 � Λ A (s), fixing ω = 2/3 as the true hazard ratio such that S T B ð1Þ ≔ 0:5 2=3 � 0:63.Additional simulation results with ω = 1/2 and ω = 4/5 will be presented in the S1 Appendix.
We let the number of patients in the historic group n A 2 {200, 400, 800} vary between scenarios and used our sample size equation to determine the required number of patients in the intervention group n B for a targeted power of 1 − β = 80% for our new test under planning alternative K 0 : Λ B = 2/3 � Λ A for a two-sided significance level of α = 5%.From these parameter constellations we omitted any scenario with π > 2 since these might be unrealistic in practical use of the methodology.
For each parameter constellation, we generated 10000 samples to which we applied both the new test as well as the classical two-sample log-rank test.Reported in Tables 1 and 2 are the empirical type I and type II error rates for each parameter constellation.To increase readability of the tables we omitted potential confidence intervals for our estimators, but notice that with 10000 samples potential 95%-confidence intervals for the type I error rate estimators given in Tables 1 and 2 should have a width of about 0.004.

Results
Reassuringly, our new test statistic keeps the significance level of α = 5%.Even in scenarios where the equal censoring mechanism assumption was moderately violated, the error rate inflation was negligible.The performance of the new test is comparable to the two-sample logrank test which is known to be asymptotically optimal under the proportional hazards assumption.In scenarios with early events (Weibull shape parameter �1.5) the performance was even slightly better.The sample size calculation algorithm, however had slightly mixed results.Many scenarios achieved the targeted power of 80% quite accurately.However, in some extreme scenarios with particularly small sample sizes and extreme shape parameters of the reference survival-curve, the sample size calculation yielded underpowered trials.For these combinations, we recommend validating the sample size by simulation and comparing the sample size to a two-armed trial using the two-sample log-rank test.

Discussion
Traditional one-sample log-rank tests compare the survival of an experimental treatment to a prefixed reference survival curve, which typically represents the expected survival under standard of care.The choice of the reference survival curve is often based on historic data on standard therapy and thus prone to statistical error.Nevertheless, traditional one-sample log-rank tests do not account for this variance of the reference curve estimator.This non-consideration of the sampling variability leads to an inflation of the type I error rate.The extent of this  As expected, the calculated sample sizes are of similar size as the sample sizes obtained through two-sample planning methodology like the Schoenfeld formula [9].Traditionally, one-sample methods are popular because they require a small sample size: By using a one-sample log-rank test, not only the sample size of the control arm is saved, but also the reference curve is considered deterministic, thereby diminishing the variability of the test statistic.This again relevantly reduces the required sample size.However, if the reference curve itself is also estimated from data the assumption of a deterministic reference curve does not hold.Then the latter reduction of sample size comes at the price of an inflated type I error rate if not addressed by adequate statical considerations [2].To obtain a valid test procedure the variability has to be accounted for, as our new proposed survival test does.
In summary we advise to use our new test methodology if the reference curve is obtained through an estimation process and access to the full historic data is restricted.It is important to note, that the usual use case of the two-sample log-rank test is a randomised controlled trial which fundamentally differs from the case of restricted reference data which is by design unrandomised and leaves no room for adjustment techniques of confounders since individual patient level data are not available.Therefore it is crucial to exercise extreme caution when using a historical control from a summarized Kaplan Meier survival curve.Thus application of the new test may lead to significant differences between the compared survival curves, but it may be unclear whether this difference stems from treatment or confounding.For this reason the methodology is suitable for phase II trials but not for phase III trials as a basis for drug approval or health technology assessment decision making.In addition, when choosing the reference curve, it must be ensured that the recruitment and censoring mechanisms of the historical cohort are equal to those of the new cohort in order to guarantee an exact type I error rate control.Therefore, we advise to use the classical two-sample log rank test in case of individual patient level survival data availability, since it does not rely on the assumption of equal recruitment and censoring mechanisms.Nevertheless, the new test has a similar performance as the two-sample log rank test which is known to be optimal under proportional hazards.Furthermore it should be noted that the new test methodology can be used with existing standard software, as the approach is equivalent to an adjustment of the nominal α-level in the standard one-sample log-rank test.
Fig 1 is obtained from data of the "AOK-Die Gesundheitskasse" (AOK), a collective of 11 local German health insurance funds covering one third of the German population.

Fig 1 .
Fig 1. Reference survival curve.Kaplan Meier estimator b S T A of overall survival of patients hospitalized with a myocardial infarction in 2014 aged 30-60, without obesity and without a prior indication of chronic heart failure.https://doi.org/10.1371/journal.pone.0305434.g001

Fig 2 .
Fig 2. Overall survival of both collectives.Kaplan Meier estimators b S T A and b S T B of overall survival of patients hospitalized with a myocardial infarction in 2014 aged 30-60, without a prior indication of chronic heart failure.https://doi.org/10.1371/journal.pone.0305434.g002 pÞ ðsee Eq: ð13Þ and Appendix B in ½2�Þ; if recruitment and censoring mechanism were equal in the new treatment group B and the historical group underlying the estimate b L A .This motivates us to propose the following random variable as a new survival test statistic